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Abstract Rooted phylogenetic networks are often constructed by combining trees, clusters, 
triplets or characters into a single network that in some well-defined sense simultaneously 
represents them all. We review these four models and investigate how they are related. In 
general, the model chosen influences the minimum number of reticulation events required. 
However, when one obtains the input data from two binary trees, we show that the min- 
imum number of reticulations is independent of the model. The number of reticulations 
necessary to represent the trees, triplets, clusters (in the softwired sense) and characters 
(with unrestricted multiple crossover recombination) are all equal. Furthermore, we show 
that these results also hold when not the number of reticulations but the level of the con- 
structed network is minimised. We use these unification results to settle several complexity 
questions that have been open in the field for some time. We also give explicit examples to 
show that already for data obtained from three binary trees the models begin to diverge. 



1 Introduction 



Consider a set of taxa X. A rooted phylogenetic network on A" is a rooted directed acyclic graph 
in which the outdcgrcc-zcro nodes (the leaves) arc bijectivcly labelled by X. It is common to 
identify a leaf with the taxon it is labelled by and it is usually assumed that there are no nodes 
with indegree and outdegree one; we adopt both conventions. Nodes with indegree at least two are 
called reticulations. The edges entering a reticulation are called reticulation edges. Nodes that are 
not reticulations are called tree nodes. A phylogenetic network is called binary if all reticulations 
have indegree two and outdegree one and all other nodes have outdegree zero or two. 

One of the main challenges in phylogenetics is to reconstruct phylogenetic networks from biolo- 
gical data of currently living organisms. The reticulations in a phylogenetic network are of special 
biological interest. These nodes represent "reticulate" evolutionary phenomena like hybridisation, 
recombination or lateral (horizontal) gene transfer. Motivated by the parsimony principle, a phylo- 
genetic network with fewer reticulations is often preferred over a network with more reticulations, 
when both networks represent the available data equally well. 

Thus, we define the following fundamental problem MinRet. Given some set V of data de- 
scribing some set X of taxa, find a phylogenetic network on X that "represents" T> and contains 
a minimum number of reticulations over all phylogenetic networks on X representing V. We 
consider three specific variants of this problem: MinRetTrees, MinRetTriplets and MlN- 
RetClusters, for data T> consisting of trees, triplets and clusters respectively. 

The following subtlety has to be taken into account when reticulations with indegree higher 
than two are considered. When counting such reticulations, indegree-d reticulations are counted 
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d — 1 times, because such reticulations represent d — 1 reticulate evolutionary events (of which the 
order is not specified). Hence, using S~{v) to denote the indegree of a node v, we formally define 
the number of reticulations in a phylogenetic network N — (V, E) as 

J2 iS-iv)-l) = \E\-\V\ + l . 

v£V:S^{v)>0 



Instead of minimizing the total number of reticulations in a network, another possibility is 
to minimize the number of reticulations in each nontrivial biconnected component (informally: 
tangled part) of a network. Formally, a biconnected component is a maximal subgraph that cannot 
be disconnected by removing a single node. A biconnected component is trivial if it is equal to a 
single edge and nontrivial otherwise. For k € N, a phylogenetic network is called a level-k network 
if each nontrivial biconnected component contains at most k reticulations. See Figure [T] for an 
example of a phylogenetic network with four reticulations. This is a level-3 network, because each 
nontrivial biconnected component contains at most three reticulations. 
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Figure 1. A level-3 phylogenetic network with four reticulations. Nontrivial biconnected compon- 
ents are encircled by dashed lines. Reticulations are unfilled and colored red. Reticulation edges 
are also indicated in red. 



This leads to the definition of the following MinLev variant of the fundamental problem. 
Given some set V of data describing some set X of taxa, find a level-fc phylogenetic network that 
"represents" 2? such that k is as small as possible. There are again three versions: MinLevTrees, 
MinLevTriplets and MinLevClusters, for data V consisting of trees, triplets and clusters 
respectively. 

The definition of "represents" heavily depends on the nature of the data in V. We will discuss 
four types of data: trees, triplets, clusters and binary characters. Throughout the paper we assume 
a fixed set X of taxa. 



1.1 Trees 



A phylogenetic tree on A" is a phylogenetic network on X without reticulations. There exist nu- 
merous methods that construct phylogenetic trees, for example from DNA data. These methods 
include Maximum Likelihood, Maximum Parsimony, Bayesian- and distance-based methods like 
Neighbor Joining. When phylogenetic trees are constructed for several parts of the genome sep- 
arately (e.g. several genes), one often obtains a number of different phylogenetic trees. The same 
can occur when several phylogenetic trees are constructed using different methods. 




Figure 2. A phylogenetic tree T (a) and a phylogenetic network N (b,c,d); (b) illustrates in red 
that N displays T (edges not in the subdivision are dashed); (c) illustrates in blue that N is con- 
sistent with the triplet cd\f from T (edges not in the embedding are again dashed); (d) illustrates 
in green that N represents cluster {c, d, e} from T in the softwired sense (dashed reticulation edges 
are "switched off"). 

Thus, given a number of phylogenetic trees, it is interesting to find a phylogenetic network that 
"represents" each of them. This is formalized by the notion of "display" as follows. A phylogenetic 
tree T is displayed by a phylogenetic network N if T can be obtained from some subtree of N by 
suppressing nodes with indegree one and outdegree one (i.e. if some subtree of is a subdivision 
of T). See Figure [2] for an example. 

For a set T of phylogenetic trees on A", we define: 

• ft (T) as the minimum number of reticulations in any phylogenetic network on X that displays 
each tree in T and 

• £t{T) as the minimum k such that there exists a level-fc phylogenetic network on X that 
displays each tree in T. 

The computation of rt has received much attention in the literature. For two binary trees on 
the same taxon set the problem is NP-hard and APX-hard [4] although on the positive side it 
is fixed-parameter tractable in [H] offers a good overview of these and related results. 

These algorithmic insights have been translated into the software HybridNumber [2 and its 
more advanced successor HybridInterleave These programs compute rt exactly for two 
binary trees on the same taxon set. The program SPRDiST solves the same problem (using 
integer linear programming) and the program PIRN |35j can compute lower and upper bounds 
on rt for any number of binary trees on the same taxon set. In |15j a polynomial-time algorithm is 
described that constructs a level-1 phylogenetic network that displays all trees and has a minimum 
number of reticulations, if such a network exists. 



1.2 Triplets 

A (rooted) triplet on is a binary phylogenetic tree on a size-3 subset of X. We use xy\z to 
denote the triplet with taxa x,y on one side of the root and z on the other side of the root. 
Triplets can be constructed using any of the methods for constructing phylogenetic trees (using 
a fourth taxon as an outgroup in order to root the triplet). Alternatively, one can first construct 
one or more phylogenetic trees and subsequently find the set of triplets that are contained in these 
trees. The main motivation for the latter approach is that representing all triplets might require 
fewer reticulations than representing the entire trees. 

This can be formalised by using the notion of display introduced above. For triplets, often 
"consistent with" is used instead of "displayed by". A triplet xy\z is consistent with a phylogenetic 



network N (and N is consistent with xy\z) if xy\z is displayed by N. See Figure[2]for an example. 
Given a phylogenetic tree T on X, we let Tr{T) denote the set of all rooted triplets on X that 
are consistent with T. For a set of phylogenetic trees T, we let Tr{T) denote the set of all rooted 
triplets that are consistent with some tree in T, i.e. Tr{T) = UtsT -^''(-^)- 

For a set TZ of triplets on A", we define: 

• rtr{'R-) as the minimum number of reticulations in any phylogenetic network on X that is 
consistent with each triplet in TZ and 

• ifrilZ) as the minimum k such that there exists a level-fc phylogenetic network on X that is 
consistent with each triplet in TZ. 

Throughout the article we will write rt^.{T) and itriT) as abbreviations for rtr{Tr{T)) and 
£tr{Tr{T)) respectively. 

A triplet set 7?, on A" is said to be dense when, for every three distinct taxa x,y, z € X, at least 
one of xy\z,xz\y,yz\x is in TZ [TB]. Given a dense triplet set, [I1][I7] describe a polynomial-time 
algorithm that constructs a level-1 network displaying all triplets, if such a network exists. The 
algorithm [301 can be used to find such a network that also minimizes the number of reticulations, 
and this is available as the program Marlon These results have later been extended to 
level-2 [27] [30] (see also the program Level2 [26]) and more recently to level-fc, for all fc G N [25] . 
The program Simplistic [H][30] can be used to construct (simple) networks of arbitrary level 
(again, assuming density). 



1.3 Clusters 

A duster on A* is a proper subset of X. Clusters can be obtained from morphological data (e.g. 
species with wings, species with eight legs, etc.) or from phylogenetic trees. The latter approach has 
a similar motivation as in triplet methods. The clusters from the trees might be representable using 
fewer reticulations than that would be necessary to represent the trees themselves. In addition, 
the clusters described by a phylogenetic tree are biologically the most interesting features of the 
tree, because they describe putative monophyletic groups of species (clades). 

We use Cl{T) to denote the set of clusters of a phylogenetic tree T, i.e. for each edge (u,w) 
of T, the set Cl{T) contains a cluster consisting of those taxa that are reachable by a directed 
path from v. For a set T of phylogenetic trees, we define Cl{T) = IJ^^^ CZ(T). 

Similar to tree- and triplet methods, the general aim of cluster methods is to construct a 
phylogenetic network that "represents" some set of input clusters. There are two different notions 
of "representing" for clusters: the "hardwired" and the "softwired" sense. Given a cluster C (Z X 
and a phylogenetic network on A", we say that N represents C in the hardwired sense if there 
exists an edge (u, v) in A^ such that C is the set of taxa reachable from w by a directed path [TS] . 

The definition of "representing" in the "softwired sense" is longer but biologically more relevant. 
We say that N represents C in the softwired sense if there exists an edge (w, v) in A" such that C 
is the set of taxa reachable from z; by a directed path, when for each reticulation r exactly one its 
incoming edges is "switched on" and all other edges entering r are "switched off" (see Figure [2]). 
As a direct consequence, C is represented by A^ in the softwired sense if and only if there exists 
a phylogenetic tree T on A" that is displayed by A^ and has C G Cl{T). In this article, we do not 
consider cluster representation in the hardwired sense and therefore often write "represents" as 
short for "represents in the softwired sense" . 



For a set of clusters C on A, we define: 



• rc{C) as the minimum number of reticulations in any phylogenetic network on X that represents 
ah clusters in C in the softwired sense and 

• ^c(C) as the minimum k such that there exists a level-A: phylogenetic network on X that 
represents all clusters in C in the softwired sense. 

We write r^T) as shorthand for rc{Cl{T)) and idT) as shorthand for £c{Cl{T))- 

A network is a galled network if it contains no path between two reticulations that is contained 
in a single biconnected component. In [llj and [14j an algorithm is described for constructing a 
galled network representing C in the softwired sense. In [33] the algorithm Cass ^32 is presented 
which aims at constructing a low-level network that represents C. Cass always returns a net- 
work representing all input clusters and, when ic{C) < 2, it is guaranteed to compute ic exactly. 
Alongside the algorithms from [I1][II][I3] Cass is available as part of the program Dendroscope 



1.4 Binary character data 

Within the field of population genomics the literature on phylogenetic networks has evolved along 
a slightly different route to the literature on trees, triplets and clusters. At the level of populations 
the principle reticulation event is the recombination, and in this context phylogenetic networks are 
sometimes called recombination networks. To avoid repetition we refer to (TU] [5] [H] for background 
and definitions. In this article we will always assume that recombination networks are constructed 
from binary character data and that the root sequence is the all-0 sequence i.e. we are dealing 
with the "root known" variant of the problem. We assume thus that the input is a binary n x m 
matrix M. 

The basic definition given in TIT is for the unrestricted multiple crossover variant of the recom- 
bination network model. Stated informally this means that, at each reticulation, each character 
can freely "choose" from which of its parents it inherits its value. This is quite different to the 
single crossover variant which has received far more attention in the literature. In the single cros- 
sover variant the sequence at a reticulation is forced to obtain a prefix from one of its parents, 
and a suffix from the other, thus modelling chromosomal crossover. 

For a binary matrix M, we define: 

• rsc{M) as the minimum number of reticulations required by a recombination network that 
represents M, assuming the single crossover variant and an all-0 root, and 

• r^dM) as the minimum number of reticulations required by a recombination network that 
represents M, assuming the unrestrained multiple crossover variant and an all-0 root. 

Given that the latter is a relaxation of the former, it is immediately clear that for any input M, 

TuciM) < rsc{M). (1) 

In [31] it was claimed that it is NP-hard to compute rue However, [1] subsequently discovered 
that the proof in |34j was partially incorrect and modified it to prove that computation of r^c is 
NP-hard. 

There are some definitional subtleties when trying to map between the model of fTU| and the 
other models summarised in this article. Some differences between the models are rather arbitrary 
and minor and thus easy to overcome, and we do not discuss them here. In this article we restrict 
ourself to a more fundamental comparison concerning (under an appropriate transformation) the 



values r^c, r^c and re- 

The problem of computing Tsc (in defiance of its NP-hardness) has attracted much attention. 
Articles such as [101 [8] [36] [24] [1^ give a good overview of the methods in use. Much energy has 
been invested in computing lower bounds for r^^ (e.g. the program HapBound 24J), and some 
lower bounding techniques also produce valid lower bounds for r„c (e.g. jlOj). Programs such as 
Shrub [53] produce upper bounds on r^c, and Beagle [11] uses integer linear programming to 
compute Tsc exactly (for small instances). The programs HapBound-GC and Shrub-GC com- 
pute lower and upper bounds on a value that lies somewhere between r^c and r„c [23] . As in 
other areas of the phylogenetic network literature the problem of computing r^c in a topologically 
constrained space of networks [5] has also been considered. 



1.5 Summary of Results 

In this article, we study how several methods for constructing phylogenetic networks are related. 
We begin by clarifying the relationship between phylogenetic networks that represent clusters in the 
softwired sense and recombination networks that represent binary character data. We explain that 
the two models are equivalent when unrestricted multiple crossover recombination is considered but 
fundamentally different when single crossover recombination is used. This clarification is necessary 
to place the main results from this article in the correct context. 

We then turn to the problem of constructing phylogenetic networks from trees, triplets or 
clusters. In particular, we focus on triplets and clusters obtained from a set of trees on the same 
set of taxa. We show that the number of reticulations required to display the triplets is always 
less than or equal to the number of reticulations necessary to represent all clusters, and the latter 
number is in turn less than or equal to the number of reticulations necessary to display the trees 
themselves: 

rtriT) < T,{T) < n{T) . 

We give examples for which these inequalities are strict i.e. an example in which the triplets 
need strictly fewer reticulations than the clusters and an example in which the clusters need strictly 
fewer reticulations than the trees. 

However, the main result of this article shows that, when one considers a set T containing 
two binary trees on the same set of taxa, the numbers of reticulations required to represent the 
triplets, clusters or the trees themselves are all equal: 

TtriT) ^ r,iT) ^ rtiT) . 
In addition, all the results above also hold for minimizing level. In particular: 

These unification results turn out to have important consequences. We use the equalities above 
to settle several complexity questions that have been open for some time and to strengthen sev- 
eral existing complexity results. In particular, we show that computation of ^t(T), fc(T), ^c(7~), 
rtr{T) and £tr{T) are all NP-hard and APX-hard even when T consists of two binary trees on 
the same set of taxa. Thus, problems MinRetTriplets, MinRetClusters, MinLevTrees, 
MinLevTriplets and MinLevClusters are aU NP-hard and APX-hard. 



2 Spot the difference 



2.1 Clusters and binary character data 

We say that two clusters Ci , C2 C A" are compatible if either Ci n C2 = or Ci C C2 or C2 C Ci 
and incompatible otherwise. 

Let C be a set of clusters on X. Let X = {xi, ...,Xn} and C = {ci, ...,Cm} i-e. impose an 
arbitrary ordering on X and C. The matrix encoding of C is a binary matrix Mat{C) with n rows 
and m columns. AIat(C)ij has the value 1 if and only if cj contains taxon x;. It is also natural 
to define the "dual" encoding. Given an n x m binary matrix M, the cluster encoding of M is a 
cluster set Clus{M) containing a set of m clusters {ci, ...,Cm} on taxon set {xi, ...,a;„} such that 
Cj contains Xi if and only Mij has value 1. Clearly both encodings can be produced in polynomial 
time. 

The following result was presented in [7] and is to some extent implicit in [TS' (and thus should 
be attributed to these two groups of authors) although to the best of our knowledge has never 
been formally written down. It shows that in a very strong sense the construction of phylogenetic 
networks from clusters, and recombination networks from binary characters under the all-0 root, 
unrestricted multiple crossover variant, are equivalent. 

Observation 1 Given a cluster set C, any phylogenetic network N that represents C can be re- 
labelled (after possibly a trivial modification) to obtain a recombination network that represents 
Mat{C) under the unrestricted multiple crossover variant with all-0 root. Given a binary matrix 
M , any recombination network that represents M under the unrestricted multiple crossover vari- 
ant with all-0 root can be relabelled (after a possibly trivial modification) to obtain a phylogenetic 
network that represents Clus{M). 

Proof. The core idea is that the edges which represent clusters will become the edges upon which 
mutations from to 1 will occur, and vice-versa. We will now formalise this. 

Consider first a cluster set C = {ci, ...,Cm} and a phylogenetic network N that represents it. 
If necessary we first modify N slightly to ensure that every reticulation has outdegree exactly 1. 
Now, for each cluster Cj G C there exists some tree Tj on X that is displayed by N and which 
represents cj. To obtain the recombination network for Mat{C) we relabel as follows: the root of N 
receives the all-0 sequence and for each cj (1 < j < m) we locate the edge Cj in Tj that represents 
Cj, and fix some subdivision of Tj in N. The edge ej will thus correspond to a directed path of 
edges in N; we arbitrarily choose one edge from this path as the edge at which character j mutates 
from to 1. (We can assume without loss of generality that this is not a reticulation edge). For 
each node u in we say that character j has value 1 if and only if v lies in the subdivision of 
Tj that we fixed and the node v' in Tj to which it corresponds, is reachable in Tj from Cj by a 
directed path. In particular, each character at a reticulation v inherits its value from the node 
immediately preceding v in the subdivision. 

Given an n x m binary matrix M and a recombination network N that represents it under 
the unrestricted multiple crossover variant with all-0 root, we first ensure that reticulations in N 
with outdegree are modified to have outdegree exactly 1. Now, we can relabel N as follows. The 
leaf labelled with row i of M is mapped to taxon Xi of X. Now, recall that the jth column of M 
corresponds to cluster Cj e C'lus{M). Consider any such j. At every node w in iV it is either (i) 
unambiguous from which parent of v the value of character j was inherited, or (ii) it is ambiguous, 
in which case we can arbitrarily choose any such parent, or (iii) character j mutates from a to 1 
on one of the edges feeding into u, in which case choose that edge. This induces a tree which will 
be a subdivision of some tree Tj on X. Furthermore, Tj represents Cj, and we are done. □ 



Corollary 1. Given a cluster set C, rc(C) = ruc{Mat{C)). Given a binary matrix M , r^dM) — 
rc{Clus{M)). 

It is natural to wonder whether the single crossover variant is genuinely more restrictive than the 
unrestrained multiple crossover variant. Could it be, for example, that the columns of an input 
matrix M can always be re-ordered to obtain a matrix M' such that rsc(M') = rudM)? This is 
not so, as the following simple example shows. We observe firstly that for a cluster set C on a set 
of taxa X, rc(C) < \X\ — 1. This follows because we can use the construction depicted in Figure 
[Sj Now, for any integer p > 5 we let Cp be the set of all clusters that contain exactly [p/2 + Ij 
elements of X, where A" is a taxon set on p elements. Let M = Mat{Cp). It follows by Observation 
[Ijthat r„,(M) = rdClus{M)) = r^Cp) <p-l. 




Xn-l 



Figures. A network that is consistent with all 3(3) triplets and represents all 2" — 1 clusters on 
taxon set X = {xi, ...,Xn}- 

Clearly M has k = (^p/f+ij) columns and k grows exponentially in p. Let M' be obtained from 
M by arbitrarily permuting its columns. Note that any adjacent pair of columns in M' fails the 
three-gamete test (with respect to the all-0 root) because two distinct clusters containing [p/2-t-lJ 
elements are necessarily incompatible. Hence, if we partition the columns of M' into [A:/2J disjoint 
pairs of adjacent columns, and apply a composite haplotype bound (i.e. apply the haplotype bound 
independently to each disjoint pair of columns) [24|[2n], it follows that rsc{M') > [k/2\ . This lower 
bound grows exponentially in p, independently of the exact column permutation applied, while the 
upper bound on r„c(M) grows only linearly. For p > 5 the gap between these bounds is already 
greater than zero. 

We remark in passing that the "root unknown" version of the unrestrainted multiple crossover 
variant (let us denote this r^^) has an interesting interpretation when given Mat{C) as input. In 
the "root unknown" version characters are allowed to start with value 1 at the root and mutate at 
most once to (as opposed to always starting with value at the root and mutating at most once 
to 1). It follows then that r*^(Afat(C)) is the minimum number of reticulations ranging over all 
networks that, for each cluster c G C, represents c or the complementary cluster jA"! \ c. It is easy 
to see that r'^d^o±{'^)) can be significantly smaller than r„c(A'/ai(C)). For example, consider the 
set C of all size-2 clusters on a size-3 taxon set X. These clusters are mutually incompatible, so 



ruc{Mat{C)) > 1. However,the complement of each cluster is a singleton cluster, so (by choosing 
the all-1 root) <^(Mat(C)) = 0. 



2.2 Clusters and triplets coming from trees 

Let us take a closer look at sets of triplets or clusters that are obtained from a set T of phylogenetic 
trees on the same set of taxa. Wc will show that any phylogenetic network that represents Cl{T) 
is consistent with Tr{T). It follows that representing all triplets requires at most as many reticu- 
lations as representing all clusters. Moreover, quite obviously, representing all clusters requires at 
most as many reticulations as representing the trees themselves. Thus, 

nAT) < r,{T) < n{T) . (2) 

Furthermore, this is true not only with respect to minimizing the number of reticulations, but 
with respect to minimizing any property of the networks, e.g. level: 

itriT) < £c{T) < it{T) . (3) 

We will show that each of the inequalities in ^ and ([S]) is strict for some set of trees T. 
First, in order to prove (H and ^, we show an important relation between Tr{T) and Cl{T). 

Lemma 1. For any three taxa x,y,z G X holds that xy\z G Tr{T) if and only if there exists a 
cluster C G C'l(T) with x,y Cz C and z ^ C . 

Proof. First suppose that there is a cluster C G Cl{T) such that x,y G C and z ^ C. Then the 
triplet xy\z is consistent with T and hence xy\z d Tr(T). 

Now suppose that xyjz G Tr{T). Then the triplet xy\z is displayed by T and hence there is a 
subtree T' of T such that xy\z can be obtained from T' by suppressing nodes with indegree one 
and outdegree one. This subtree T' contains exactly one node with indegree one and outdegree 
two. Let C be the set of taxa reachable from this node. Then, x,y G C, z ^ C and C G Cl{T). □ 

It follows that, for any set T of trees on the same set X of taxa, Cl{T) uniquely determines 
Tr(r). 

We will now prove the following proposition, from which correctness of ^ and ^ follows. 

Proposition 1. For any set T of trees on the same set X of taxa, any phylogenetic network on X 
representing Cl{T) is consistent with Tr{T). 

Proof. Let be a phylogenetic network on X representing Cl{T). Consider a triplet xy\z G Tr{T). 
By Lemma [T] there is a cluster C G Cl{T) (for some T gT) with x,y G C and z ^ C. Cluster C 
is represented by N (in the softwired sense) and hence there exists a phylogenetic tree Tc on X 
that is displayed by N and has C G Cl{Tc). Because x,y G C and z ^ C, it follows that xy\z is 
displayed by Tq- Since Tc is displayed by N, it follows that xy\z is displayed by N. Hence, N is 
consistent with xy\z. □ 



Before proceeding further, the fohowing two lemmas will be of use throughout the rest of the 
article. 

Lemma 2. Let N be a phylogenetic network on X . Then we can transform N into a binary 
phylogenetic network N' such that N' has the same reticulation number and level as N and if T 
is a binary tree displayed by N then T is also displayed by N' . 

Proof. The transformation is very simple (and can clearly be conducted in polynomial time, if 
necessary) . To begin with, each reticulation v with outdegree (which will be necessarily labelled 
with some taxon x € A") is transformed into a reticulation with outdegree 1 as follows. We introduce 
a new node v', add the edge {v,v') and move label x to node v' . Next we deal with nodes v that 
have both indegree and outdegree greater than 1. Here we replace the node v by an edge (wi, W2) 
such that the edges incoming to v now enter vi, and the edges outgoing from v now exit from V2. 
Subsequently nodes with indegree at most 1, and outdegree c? > 3, can be replaced by a chain of 
(rf — 1) nodes of indegree at most 1 and outdegree 2. Nodes with indegree d > 3 and outdegree 1 
can be replaced by a chain of {d— 1) nodes of indegree 2 and outdegree 1. The critical observation 
is that if a binary tree T is displayed by N then there is a subdivision of T in N which is also 
binary. This means that for each node v in N the subdivision uses at most two outgoing edges of 
V and at most one incoming edge of v. Hence the subdivision can easily be extended to become a 
subdivision within N' . □ 

Lemma 3. Let N be a phylogenetic network on X and T a set of binary trees on X . Then there 
exists a binary phylogenetic network N' on X such that (a) N' has the same reticulation number 
and level as N, (b) if N displays all trees in T then so too does N' , (c) if N is consistent with 
Tr{T) then so too is N' and (d) if N represents Cl{T) then so too does N' . 

Proof, (a) and (b) are immediate from Lemma[2j For (c) note that for each triplet t e Tr{T) there 
is some subdivision of t in N . K triplet t is binary, and thus so too is any subdivision of t, so we 
can apply the same argument as used in Lemma [2j For (d), note that for each cluster c S Cl{T) 
there is some tree T on X which is displayed by N and which represents c. T is perhaps not binary, 
and thus a subdivision of it in N is perhaps also not binary, so after the transformation described 
in Lemma [2] this subdivision will have become the subdivision of some binary tree T' . However, 
T' is a refinement of T i.e. Cl{T) C Cl{T') so c is also represented by N' . □ 

We will now show that each of the inequalities in ^ and (|3| is strict for some set of trees. To 
do so for the first inequality in each formula, consider the set T of three trees, and the network TV, 
shown in Figure |4j It is easy to check that N is consistent with all the triplets in Tr{T). However, 
any network that represents Cl{T) requires at least 3 reticulations, and will be level-3 or higher, 
as can be verified by a straightforward (but technical) case analysis or by using the program Cass 
[55] . Specifically: if a level-1 or level-2 network existed that represented Cl{T) then Cass would 
definitely find it, and it does not. 

Figure [5] shows a set T of trees for which the second inequality in ^ and ^ is strict. A level-1 
network with one reticulation is shown that represents all clusters from the three trees. However, 
a network with k reticulations can display at most 2^ distinct trees, so any network that displays 
all three trees will require at least two reticulations. It will also have level at least 2, because 
a (without loss of generality) binary level-1 network displaying all three trees would have two 
nontrivial biconnected components, and thus all three trees would have a common non-singleton 
cluster, but this is not so. 

Although we do not present a proof, empirical experiments furthermore suggest that it is 
possible to "boost" the example given in Figure [5] to create sets of three binary trees T such that 
the gap between rt[T) and rc(T) can be made arbitrarily large |31j . 




Figure 4. The triplets obtained from the three threes on the left are consistent with the level- 
2 network on the right containing two reticulations. However, any network representing all the 
clusters from these trees will have at least three reticulations and be level-3 or higher. 




a b c d e 



Figures. The lcvel-1 network on the right with a single reticulation represents the union of the 
clusters (and triplets) obtained from the three trees on the left. However, any network that displays 
all three trees will have at least two reticulations and have level at least two. 

2.3 Clusters and triplets coming from two binary trees 

This section presents the main results of this paper. We will show that the number of reticulations 
necessary to represent the clusters from two binary trees on the same taxa is equal to the number 
of reticulations necessary to represent the trees themselves. In addition, we will show that also 
the number of reticulations necessary to represent all triplets from the two trees is equal to the 
number of reticulations necessary to represent the trees themselves. Moreover, we will show that 
the same is true when not the number of reticulations but the level of the networks is minimized. 
This means that for data coming from two binary trees on the same set of taxa, the tree-, cluster- 
and triplet problems all coincide. 

Let T be a set containing two binary phylogenetic trees on the same set of taxa. Recall 
that Cl{T) is the set of all clusters from both trees in T and Tr(T) is the set of all triplets 
from both trees. We start by showing that the minimum number of reticulations in a network 
consistent with Tr{T) is equal to the minimum number of reticulations in a network displaying 
both trees in T. The fact that also the number of reticulations necessary to represent CZ(T3 



is the same will be a corollary. After this corollary we will show that the results also hold for 
level- minimization. 

First, however, some context is necessary. As mentioned earlier ^4] fixed the partially correct 
result of [33] to prove that computation of r^c is NP-hard. The correct part of the proof in [33], 
Claim 2, essentially showed that, for a set T = {Ti,T2} of two binary trees on a set X of taxa, 
rt(T) < ruc{M*) where M* is the concatenation of Mat{Clus{Ti)) and Mat{Clus{T2)) into a 
single matrix containing 4(n — 1) columns (i.e. characters) and \X\ rows. By ([I]) they thus also 
proved that that ^((T) < TsdM*) and this fact is used in [3|^ Now, observe that Clus{M*) is 
equal to Cl{T). Hence, by Observation [ij rt{T) < r„c(M*) = rc(T). It is clear that rdT) < rt{T) 
and hence rt{T) — rdT). In this sense the equivalence of r-t(T) and r^T) for pairs of binary trees 
was already implicitly present in the literature. However, given (a) the lack of clarity in the proof 
of [34] , (b) the fact that Observation [l] has only been implicitly present in the literature up until 
now and (c) the desire to produce a unification result which also includes triplets, we have decided 
that it is useful to directly and explicitly prove this two-tree result and to explore its consequences. 

Theorem 1. If T — {Ti,T2} consists of two binary phylogenetic trees on the same set of taxa, 
rtr{T)^rt{T). 

Proof. To increase the clarity of the proof we write rt{Ti,T2) as shorthand for rt{{Ti,T2}) and 
Ttr{Ti^T2) as shorthand for rtj.[{Ti^T2}). 

Clearly, rt{Ti,T2) > rtr{Ti,T2), since any phylogenetic network displaying Ti and T2 is con- 
sistent with all triplets from Ti and T2. It remains to show rt{Ti,T2) < f'tr(2\,T2). 

Suppose this is not true. Let n be the number of leaves in a smallest counter example, i.e. n 
is the smallest number such that there exist two binary phylogenetic trees Ti and T2 on a set of 
taxa X with \X\ = n such that rt{Ti,T2) > rtr{Ti,T2). Clearly n > 3. Let Nt be a phylogenetic 
network on X with rt{Ti,T2) reticulations that displays Ti and T2 and let Ntr be a phylogenetic 
network on X with rtr(Ti,T2) reticulations that is consistent with all triplets in Ti and T2. 

We may assume by Lemma [3] that Ntr and Nt are binary. We define a reticulation leaf as a 
leaf whose parent is a reticulation and a cherry as two leaves with a common parent. 

We first prove that any binary phylogenetic network contains either a reticulation leaf or a 
cherry. Suppose that this is not true and let iV be a smallest counter example, i.e. N has no 
reticulation leaves and no cherries and has a minimum number of leaves over all such networks. 
Take any leaf a; of and let p be its parent. It cannot be a reticulation, so p is either a split node 
or the root. In both cases, we delete x and contract the remaining edge leaving p, giving a smaller 
counter example. We conclude that any binary phylogenetic network contains either a reticulation 
leaf or a cherry. Hence, this is also true for Ntr- 

First suppose that Ntr contains a cherry. Let this cherry consist of leaves a, b and their common 
parent v. Then {a, b} is a cluster of Ti and of T2 i.e. they both contain an edge whose set of leaf 
descendants is exactly {a, b}. If this was not so, then at least one of Ti and T2 would be consistent 
with a triplet ac\b or bc\a for some c ^ {a,b} and such a triplet is not consistent with Ntr- It 
follows that each of Ti and T2 contains a cherry with leaves a, b. Let T{ and T2 be the trees 
obtained from Ti , T2 respectively by deleting leaves a and b and labeling their common parent 
by a new label ab- Now, Theorem 1 of Baroni et al. [1] states that, given a phylogenetic tree T 

^ The specific column ordering in M* - first tlie clusters from Ti in arbitrary order, and tlien the clusters 
from T2 in arbitrary order - is important for establishing that rt{T) < rsc(M*). In particular, it is easy 
to construct instances {Ti,T2} such that a bad permutation of the columns of M* causes rsc{M*) to 
be arbitrarily larger than rtiT)- 



and a cluster C S Cl{T), let T\C denote the subtree of T on taxon set C and let T^^'^ denote 
the phylogenetic tree obtained from T by replacing the subtree on C by a new leaf c. Then, 
rt{Ti,T2) = rtiTi\C,T2\C) + rt(rf Tf whenever C £ Cl{Ti) n Cl{T2). Hence, if we take 
C = {a, 6} we have that rt{T[,T!^) = rt{Ti,T2). 

Furthermore, rtriT[,T2) < rtr(Ti,T2) because deleting a and b from Ntr and labelling v by ab 
leads to a phylogenetic network with rtr{Ti,T2) reticulations that is consistent with all triplets 
in T{ and T2. We conclude that 

rt(r{,r^) = r*(ri,T2) > rtr{Ti,T2) > rt,(r{,T^) . 

Hence, we have constructed a smaller counter example, which shows a contradiction. 

Now suppose that Ntr contains a reticulation leaf. Let x be such a leaf and r its parent. 
Let Ntr\x be the result of removing x and r from Ntr- Let A^f\a; be the result of removing x 
from Nt and removing the former parent of x as well if it is a reticulation. Let Ti\x and T'2\a; be 
the trees obtained from Ti and T2 respectively by removing x and contracting the remaining edge 
leaving the former parent of p. That is, do the following for i G {1, 2}. Let Pi be the former parent 
of X. If Pi is not the root, there is one edge {uf,pi) entering pi and one edge (pi,wf) leaving pi. 
Remove pi and replace the edges {uf ,pi),{Pi,Vi) by a single edge {uf,vf). We will use the edges 
{uf,vf) later on. If pi is the root, we remove x and pi and leave {uf,vf ) undefined. 

First observe that Ntr\x is consistent with all triplets of ri\a; and T2\x- Moreover, since Ntr\x 
contains one reticulation fewer than Ntr, 

rtriTi\x,T2\x) < rtrin,T2) < rt{Ti,T2) (4) 

and hence 

rt.(ri\„r2\,) <rt{Ti,T2)-2 . 

Now observe that Nt\x displays Ti\j; and T2\x- We will show that 

rt{Ti\x,T2\x)>rt{Ti,T2)-l . (5) 

Together, Q and ^ imply that 

rt.(ri\„r2\,) < rt(Ti,T2) - 2 < rt{Ti\x,T2\x) - 1 
and hence that we have obtained a smaller counter example, which is a contradiction. 

It remains to prove ([s]). Let N' be a phylogenetic network on A" \ {a;} with rt{Ti\x,T2\x) 
reticulations that displays Ti\x and r2\2:. Since Ti\x is displayed by N' , there exists a subgraph Ex 
of N' that is a subdivision of Ti\x (an embedding of ri\a; into N'). Similarly, let E2 be a subgraph 
of N' that is a subdivision of T2\x- We will now use the edges and (u|,t'2) that we 

introduced when defining Ti\x and T2\x- For i € {1, 2}, if the edge (itf , vf) has been defined, we 
define the edge as follows. The edge {uf,vf) corresponds to a directed path in Ei. Let be any 
edge of this path. Notice that is an edge of N' . 

Let N^ be the network obtained by subdividing ei and 62 and making x a reticulation leaf 
below the new nodes. To be precise, for i £ {1, 2}, if = {ui,Vi) has been defined, replace by 
(wj, n,), (rii, t^i) with Ui a new node. If (w^, w^) has not been defined, add a new root rii and an edge 
from Ui to the old root. Finally, add a leaf labelled x, a new reticulation r and edges (ni, r), (n2, r) 
and (r, x). 

Observe that N^ displays Ti and r2 , because we can simply extend each of the embeddings Ei 
and E2 by the new edges leading to the leaf x. Moreover, N^ contains exactly one reticulation 
more than N' . Thus, rt{Ti,T2) < rt{Ti\x,T2\x) + 1, which remained to be shown. □ 



Corollary 2. IfT consists of two binary phy log enetic trees on the same set of taxa, 

rtr{T) = r,{T)^rt{T) . 



Proof. Follows from combining Theorem [T] with ([2|. 
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Figure 6. The network on the right represents the union of the clusters (and triplets) obtained 
from the two trees on the left, but it does not display both trees. 

Given this result it is natural to ask whether every network that represents all the clusters (or 
triplets) from two binary trees Ti and T2 on the same taxon set, and having a minimum number 
of reticulations, also displays Ti and T2. This is not so. Consider the two trees in Figure [6j It 
is easy to check that two reticulations are necessary and sufhcient to display both these trees. 
The network in this figure contains two reticulations and represents the union of the clusters (and 
triplets) from both trees, but it does not display both trees. 

We note that Theore m [l] and Corollary [2] do not hold for sets of three or more trees, as 
demonstrated in Section |2.2f by Figure [sj In addition, they also do not hold for two possibly 
non-binary trees, as demonstrated by Figure [7] 




bed 




bed 




bed 



Figure 7. The network on the right displays the two trees on the left: at least one reticulation 
is necessary. However, the tree on the left is sufficient to represent the union of the clusters (or 
triplets) obtained from both trees. 



For a binary phylogenetic network N on X the notion of a cut-edge is well-defined: an edge (u, v) 
whose removal disconnects N . A cut-edge is trivial if at least one of the disconnected components 
created by its removal contains fewer than 2 taxa from X , and is called nontrivial otherwise. N is 
said to be simple if it does not contain any nontrivial cut-edges. 

Theorem 2. IfT consists of two binary phylogenetic trees on the same set of taxa, 

ttr{T) ^ UT) = it{T) . 



Proof. By (|3]), it suffices to show £t(T) < ^tr(T). We do so by induction on \X\. The base case 
for \X\ < 2 is clear. Now consider a set of trees T on X with \X\ = n. Let Nt be a network 
that displays all trees in T and has optimal level £t (T) . Similarly, let Ntr be a network consistent 
with Tr{T) that has optimal level itriT). By Lemma [s] we may assume that and TYj^ are both 
binary. We distinguish three cases. 

First suppose that neither Nt nor Ntr contains nontrivial cut-edges, i.e. that Nt is a simple level- 
(■t{T) network and Ntr is a simple level-£tr(7~) network. In that case, the number of reticulations 
in Nt is equal to it{T)- So, rt{T) < £t{T)- At the same time, rt{T) > h{T), since the number 
of reticulations in any network is at least equal to its level. Thus, rt{T) = tt{T). Similarly, 
TtriT) = ltr{T). Moreover, by Theorem [l] rtriT) = ft(T) and we can conclude that ltr{T) — 
rtr{T) = rt{T)=lt{T). 

Now suppose that Nt contains at least one nontrivial cut-edge and let e be such an edge. Let C 
be the set of taxa reachable from e by a directed path. Let T\C be the set of trees obtained by 
restricting each of the trees in T to the taxa in C and let 7~<^~*'= denote the set of trees obtained 
by collapsing, in each tree in T, the subtree on C by a single leaf labelled c. We claim that 



it{T) <m'A^{tt{T\C),tt{T^^l) 
= max{^t,(r|C),4.(r^^^)} 

<itr{T) . 

To see that ^t(T) < max{^t(T|C), i?t(T'"^'^)}, notice that any network displaying f^^'^ can 
be combined with any network displaying T\C in order to obtain a network displaying T- This can 
be done by replacing the leaf c of the network displaying 7"^'^'= by the network displaying T|C. 
The network obtained in this way displays T and its level is equal to the maximum of the levels 
of the networks displaying T'^^" and T\C. So, ItiT) < maji{it{T\C),it{T'^^'')}. Then we use 
that it{T\C) = £tr{T\C) and ItiT^'^") = £tr{T^^'') by induction. To prove the last inequahty, 
observe that ltr{T\C) < £tr{T) because removing leaves can not increase the level. In addition, 
£tr{T^'^'') < ltr{T) because T^^" can be constructed by removing all leaves in C except for one, 
which is relabeled c, and removing or relabeling leaves can not increase the level. 

The final case is that Ntr contains a nontrivial cut-edge e. Let C be the set of taxa that can be 
reached from e by a directed path in Ntr- Clearly, for x,y & C and z ^ C, xy\z G Tr{T). Thus, C 
is a cluster of each of the trees of T- Therefore, we can argue in the same way as in the previous 
case that ItiT) < ttr{T). □ 



3 Complexity Consequences 

Theorem [l] and Corollary [2] allow us to elegantly settle several complexity questions in the phylo- 
genetic network literature that have been open for some time, and to significantly strengthen some 
already existing hardness results. 



Corollary 3. Computing rc(T) and computing rtr(T) are both NP-hard and APX-hard, even for 
sets T consisting of two binary trees on the same set of taxa. 



Proof. Follows from Corollary [2] and the fact that computmg rt{'T), for sets T consisting of two 
binary trees on the same set of taxa, is NP-hard and APX-hard □ 



It follows directly that the following two problems are NP-hard and APX-hard. 
MinRetClusters 

Instance: A set X of taxa and a set C of clusters on X. 

Objective: Construct a phylogenetic network on X that represents each cluster in C and has a 
minimum number of reticulations over all such networks. 

MinRetTriplets 

Instance: A set X of taxa and a set TZ of triplets on X . 

Objective: Construct a phylogenetic network on X that is consistent with each triplet in C and 
has a minimum number of reticulations over all such networks. 

Moreover, the latter problem is even NP-hard and APX-hard for dense sets of triplets. This 
strengthens a result by Jansson et al. [HI, who showed that MinRetTriplets and MinLevTriplets 
are NP-hard, by constructing a non-dense set of triplets such that positive instances of the NP- 
complete problem Set Splitting corresponded to a level-1 network with exactly one reticulation. 
Corollary [3] extends this result by showing that MinRetTriplets is even NP-hard for dense sets 
of triplets and that it is hard to approximate (APX-hard). 

We now turn our attention to the problems that minimize level. 

Theorem 3. Computing lt{T) is NP-hard and APX-hard, even for sets T consisting of two binary 
trees on the same set of taxa. 

Proof. We again reduce from the problem of computing rt (T) , for sets T consisting of two binary 
trees on the same set of taxa. We first reduce this problem to the restriction to pairs of trees Ti , T2 
that do not have a common non-singleton cluster. Call this restricted problem ResMinRetTrees. 

Consider a set T consisting of two binary phylogenetic trees Ti , T2 on a set X of taxa. Re- 
call Theorem 1 of Baroni et al. [T] and the application of it described in the proof of The- 
orem [l] in this article. To summarise, rt[Ti,T2) = rt{Ti\C,T2\C) + rt{T^^' ^T^^") whenever 
C £ C/(Ti) n Cl{T2). Thus, repeatedly applying the Baroni theorem, we obtain a collection of 
at most polynomially-many instances of ResMinRetTrees such that the minimum reticulation 
number of the original instance is equal to the sum of the minimum reticulation numbers of the 
obtained instances of ResMinRetTrees. Thus, we can solve the original instance by solving each 
instance of ResMinRetTrees. This completes the reduction. 

We continue by reducing ResMinRetTrees to the problem of computing it{T). Consider 
an instance {X,Ti,T2) of ResMinRetTrees. Let T = {Ti,T2}. We will prove that tt^T) = 
rt(T) and this will complete the reduction. Clearly £t{T) < rt{T). Suppose then for the sake of 
contradiction that ^t(T) < rt{T). If that is the case, then any \eve\-£t{T) network that displays Ti 
and T2 contains at least two nontrivial biconnected components. By LemmajSj there exists a binary 
such phylogenetic network N. Since this network contains at least two nontrivial biconnected 
components, it contains a cut-edge e = (m, v) such that at least two taxa are reachable from v 
(by a directed path) and at least one taxon is not. Define cluster E to contain all taxa that are 
reachable from v 'm N. Thus, \E\ > 2. Ti and T2 are both displayed by N so, for i E {1,2}, 
there is a subdivision of Ti in TV. Fix any such subdivision. So, each edge of Ti maps to a directed 
path of one or more edges in N. Both subdivisions must pass through (w, v) and it thus follows 
that ii' is a non-singleton cluster of both Ti and T2, giving us a contradiction. This completes the 
NP-hardness proof. 



To see that computing ItiX) is not only NP-hard but also APX-hard, observe that ResMin- 
RetTrees is APX-hard because (as shown above) rt{T) can be computed by simply adding up 
the optima of polynomially-many instances of ResMinRetTrees. This additivity means that an 
e-approximation to ResMinRetTrees yields an e-approximation for the problem of computing 
rt{T). Combining this with the optimality-preserving reduction from ResMinRetTrees to the 
problem of computing (t{T) described above gives the desired result. □ 

It follows directly that the following problem is NP-hard and APX-hard. 
MinLevTrees 

Instance: A set X of taxa and a set T of phylogenetic trees on X. 

Objective: Construct a level-fc phylogenetic network on X that displays each tree in T and 
such that k is as small as possible. 

Corollary 4. Computing idT) and computing itriT) are both NP-hard and APX-hard, even for 
sets T consisting of two binary trees on the same set of taxa. 

Proof. Follows from Theorem [5] and Theorem [3] □ 

Thus, also the following two problems are NP-hard and APX-hard. 
MinLevClusters 

Instance: A set X of taxa and a set C of clusters on X . 

Objective: Construct a level-A: phylogenetic network on X that represents each cluster in C and 
such that k is as small as possible. 

MinLevTriplets 

Instance: A set X of taxa and a set TZ of triplets on X . 

Objective: Construct a level-fc phylogenetic network on X that is consistent with each triplet 
in TZ and such that fc is as small as possible. 

Moreover, the latter problem is even NP-hard and APX-hard for dense sets of triplets. 



4 Concluding Remarks 

In this article, we have proven an important unification result that shows that when computing 
the minimum number of reticulations (or minimum level) required to represent data obtained from 
two binary trees on the same taxon set, it does not matter whether one calculates this using trees, 
triplets or clusters. In the process of proving this, we have clarified a number of confusing issues 
in the literature. 

The unification result has the interesting practical consequence that the two-tree case thus 
forms an interesting benchmark for comparing the performance of different phylogenetic network 
software. It was already empirically observed in ^33J , for example, that for a specific two-tree data 
set the independently developed programs Cass (which takes clusters as input, and attempts to 
minimise level), PIRN (which takes trees as input, and attempts to minimise the reticulation 
number) and HybridInterleave (which takes two binary trees as input, and minimises the 
reticulation number) all returned the same optimum. The intriguing possibility thus exists of 
creating hybrid software for the two-tree problem by combining the best parts of several existing 



software packages. It should be noted, however, that the networks achieving these optima are 
not always transferrable. For example, a network obtaining the minimum number of reticulations 
under the cluster model does not automatically display both the trees. 

It is also interesting to view our results next to other two-tree findings in the literature. Phillips 
and Warnow [21] showed that, given a set of clusters coming from two trees, it is polynomial-time 
solvable to find a phylogenetic tree consistent with a maximum number of clusters, while this 
problem is NP-hard for three or more trees. Another interesting two-tree result was discovered by 
Bordewich, Semple and Spillner [51 . They found a polynomial-time algorithm for finding an optimal 
set of taxa that maximizes the weighted sum of the phylogenetic diversity across two phylogenetic 
trees, while also this problem is NP-hard for three or more trees. It would be interesting to try and 
identify general families of objective functions (i.e. optimization criteria) for which the two-tree 
case is special. 

On the other hand, we have shown that the tree, triplet and cluster models already start to 
diverge for three binary trees on the same set of taxa. A natural follow-up question is thus: can 
we predict under what circumstances the models significantly differ, and what does it say about 
our choice of model if sometimes one model requires significantly more reticulations, or higher 
level, than another? The "triplet < cluster < trees" inequality from Section [2?2] suggests that in 
appropriate combinations existing software for triplets, clusters and trees could be used to develop 
lower and upper bounds for each other, but under what circumstances are these bounds strong? 
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